Recognizing Documents versus Meta-Documents by Tree Kernel Learning

نویسندگان

  • Boris A. Galitsky
  • Nina Lebedeva
چکیده

The problem of classifying text with respect to metalanguage and language object patterns is formulated and its application areas are proposed. Examples of metalanguage patterns in text are foreign language grammar lessons and tutorials on how to write engineering documents. The method targets the text classification tasks where keyword statistics is insufficient do distinguish between such abstract classes of text as metalanguage and object-level. To do that, we extend the parse tree kernel method from the level of individual sentences towards the level of paragraphs. We build a set of extended trees for a paragraph of text merging individual parse trees for sentences. We evaluate our approach in the domain of the design documents, differentiating them from meta-documents such as instructions on how to write design documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Classification into Abstract Classes Based on Discourse Structure

The problem of classifying text with respect to belonging to a document or a meta-document is formulated and its application areas are proposed. An algorithm is proposed for document classification tasks where counts of words is insufficient do differentiate between such abstract classes of text as metalanguage and object-level. We extend the parse tree kernel method from the level of individua...

متن کامل

A Composite Kernel Approach for Detecting Interactive Segments in Chinese Topic Documents

Discovering the interactions between persons mentioned in a set of topic documents can help readers construct the background of a topic and facilitate comprehension. In this paper, we propose a rich interactive tree structure to represent syntactic, content, and semantic information in text. We also present a composite kernel classification method that integrates the tree structure with a bigra...

متن کامل

Text Representation for Automatic Text Categorization

Automatic Text Categorization (ATC), the automatic assignment of text documents to predefined classes, is a language engineering task very relevant to a number of applications, including automatic content and knowledge management in corporations and the Internet, information access and filtering, etc. With first works dating back to 60’s [14], and increased work in the last decade (see the surv...

متن کامل

Variants of Tree Kernels for XML Documents

In this paper, we discuss tree kernels that can be applied for the classification of XML documents based on their DOM trees. DOM trees are ordered trees, in which every node might be labeled by a vector of attributes including its XML tag and the textual content. We describe four new kernels suitable for this kind of trees: a tree kernel derived from the well-known parse tree kernel, the set tr...

متن کامل

Classification of Documents Based on the Structure of Their DOM Trees

In this paper, we discuss kernels that can be applied for the classification of XML documents based on their DOM trees. DOM trees are ordered trees in which every node might be labeled by a vector of attributes including its XML tag and the textual content. We describe five new kernels suitable for such structures: a kernel based on predefined structural features, a tree kernel derived from the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015